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This paper describes a model that generates on-off speech patterns 
representative of those in experimental two-way telephone conversations. 
The model assumes a conversant to occupy one of three speaking or one of 
three silent states. Transitions among the states arc determined by Poisson 
processes governed by six parameters {one for each state) . The validity of the 
model is tested by comparing the model computer simulation of 16 conversa- 
tions with 16 real conversations. Cumulative distribution functions are 
compared for ten events (such as talkspurts, pauses, mutual silences, and 
so on) defined on the speech patterns. The model yields good fits to all 
events except "speech before interruption;" ivhen an interruption occurs, 
a model speaker tends to interrupt the other's talkspurt later than a real 
speaker does. 

Theoretical behavior of the model is also studied. All events consist of 
concatenations of exponentially distributed "state durations," even though 
most events are not tliemselves exponential. For some purposes, the ex- 
ponential distribution is a satisfactory empirical fit to talkspurts, but not 
to pauses. Possible applications of the model include studying people's 
motivations to talk and fall silent on different circuits, and predicting 
statistical behavior of voice operated devices on the circuits. 

I. INTRODUCTION 

l.i Applications of the Model 

A model for generating on-off speech patterns in two-way con- 
versations may have two uses: 

(i) It may provide insight on the dynamic processes which deter- 
mine when a person talks or is silent. For example, the model pro- 
posed here allows a person to be in one of six states, depending on 
whether he is talking, listening, or both conversants are talking, and 
so on. Each state is associated with a parameter which could be in- 
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terpreted as a "motivation" for either starting to talk or falling silent. 
As a subject talks over different experimental conditions, changes in 
the "motivation parameters" might be correlated with changes in sub- 
jective opinion of the circuit. 

(u) The model may predict the statistical behavior of voice-operated 
devices (such as echo suppressors, voice-switched amplifiers) as the 
circuits are changed. One alternative to using a model is to have peo- 
ple talk over different circuits and study the circuit behavior. This is 
often unsatisfactory because too much data may be required to isolate 
the effects of a particular circuit change. Another alternative to a 
model is to record an experimental prototype conversation and then 
play it over different circuits. This is also usually unsatisfactory be- 
cause the conversants cannot react to circuit changes ; their behavior 
remains the same. A model has the advantage of keeping the statistical 
structure of the "conversants" unchanged while allowing them to re- 
act as the circuit parameters are varied. 

A model of on-off speaking patterns is not a new concept. The de- 
sign of Time Assignment Speech Interpolation* was aided by the use 
of a number of one-way (that is, single speaker) models in parallel to 
simulate speech from many subscribers. 1 Jaffe, and others, have pro- 
posed a simple two-way Markovian model intended to study the 
speech behavior of psychiatric patients. 2 H. W. Gustafson of Bell 
Telephone Laboratories has proposed some improvements on the 
Markovian model to allow better prediction of speaker alternations. n 
The author has twice suggested a model; the first, with Mrs. N. W. 
Shrimpton (unpublished work) , suggested a simple exponential fit to 
basic events such as talkspurts, and the second used a queueing system 
of "ideas" and "utterances" to yield a more complex model for talk- 
spurts. 4 

The model proposed in this paper was developed after considering a 
large body of data from experimental two-way conversations con- 
ducted on telephone quality circuits containing no transmission delay 
or other degradations (See Table II footnote and Ref . 5) . f To evalu- 



* TASI is essentially a bank of voice-operated switches which may disconnect a 
subscriber from a channel when he is not talking to permit a talking subscriber to 
use the channel. 

t Reference 5 describes an extensive statistical analysis of speech patterns in 
16 conversations, and defines many "events," such as "talkspurt," "alternation 
silence," "pause in isolation," and so on. Average and median lengths for the 
events are tabulated, and cumulative distribution functions are included. The 
present paper assumes prior knowledge of Ref. 5. Notice that "event" is used 
to mean an interval of time, such as the interval of a talkspurt, and does not 
mean the occurrence of a probabilistic phenomenon such as the arrival of a 
pulse. 
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ate the model, we shall compare its simulation of the 16 conversations 
with data from the real conversations. 

1.2 Relation of Model to Speech Detector 

A speech detector is a rule which transforms speech into on-off 
patterns. Speech detectors are usually designed for specific needs, 
and vary considerably in their specifications. If a model is fit to one 
speech detector's output, then the model cannot be expected to be 
valid for all other detectors; but with minor changes, it may be 
adaptable to many of them. 

The author's speech detector has previously been documented, and is 
described briefly here. 8,6 An initial hardware detector, with virtually no 
"pickup" and "hangover," yields a pattern of "spurts" and "gaps," 
after segmenting the speech into 5 ms intervals. All spurts ^ 15 msec are 
presumed to be noise and are rejected (for throwaway); then all gaps 
^ 200 ms are filled in, as they were probably stop consonants or other 
minor breaks in continuous speech. The final on-off pattern contains, by 
definition, "talkspurts" and "pauses." No talkspurt can be ^ 15 ms; 
no pause can be S 200 ms. The model described here therefore generates 
talkspurts ^ 20 ms and pauses ^ 205 ms. 

The speech patterns from a speech detector are strongly influenced by 
choice of threshold. In this study, the Ref. 5 data taken with a —40 dBm 
threshold were used as a basis for the simulated conversations. 

1.3 Goals of This Paper 

The remainder of this paper is divided into two main parts. Sec- 
tion II describes the model and illustrates its empirical behavior by 
comparing its output with real conversations. The question considered 
is: "Can this model generate patterns statistically similar to those 
of a randomly selected conversation?" We do not present data on ap- 
plications such as determining differences among speakers or studying 
the behavior of a single speaker as he engages in various tasks. Future 
work is planned to investigate these problems. 

Section III is a mathematical analysis of the model's behavior. 
From this analysis, one can gain an intuitive feeling of the model be- 
havior, and acquire insight into the manner in which the two speakers 
interact. For a basic treatment of the model, however, Section III may 
be omitted. Section II assumes an elementary knowledge of probability 
theory; Section III requires some background in stochastic processes. 
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IT. MODEL DEFINITION AND EMPIRICAL BEHAVIOR 

2.1 The Model 

2.1.1 One-Port versus Many- Port Model 

Consider speakers A and B to be engaged in conversation. We shall 
model only speaker A's behavior and make no attempt to include B's 
behavior in the formulation. That is, speaker B's patterns are re- 
garded only as they appear to A. It may be that B is really talking, 
but A does not receive him because of a blocking on the transmission 
line. Or, B may be delayed, and A may be receiving B's previous 
speech when in fact B is presently silent. We shall designate our model 
as a "one-port" model, since only one port, that is, A's side, is formu- 
lated. To use the model, it could be connected to anything, such as 
another one-port model, or a one-port model connected via a trans- 
mission delay, or several one-port models as in a conference circuit. 
(It may be invalid to assume that speaker A can be modeled the same 
way in a conference as in conversation with a single other speaker, 
but the model does at least allow such a connection to be formu- 
lated.) 

In a many-port model, the entire system is modeled, with the 
drawback that a separate structure is required when special circuits 
are inserted between speakers. In addition, a one-port model leads to 
a description of each speaker, while if a many-port model is used, 
and a real conversation between A and B differs from one between 
A and C, it may not be possible to ascertain the change in speaker A's 
behavior. All we know is that the pair A-B is different from the pair 
A-C. 

2.1.2 Description of the Model 

Speaker A is either talking or silent, and he views B as either talk- 
ing or silent. As Fig. 1 shows, in the simplest case four states occur at 
A's side. A is talking in the upper (shaded) half; B is talking in the 
right half. In considering transitions from state to state, as shown by 
the arrows, we apply the restriction that the two speakers cannot 
change their states at precisely the same time. Thus, in Fig. 1, diagonal 
crossings are prohibited. 

Preliminary work with the Fig. 1 model showed that it was inade- 
quate, especially in predicting events surrounding double talk. A 
natural extension of Fig. 1 is to expand each state into two states, the 
dichotomy decided by the previous state. Figure 2 illustrates the re- 
sulting 8-state model. Consider for example "A talks, solitary": to 
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Fig. 1 — A four-state speech pattern model for speaker A. The shaded area 
indicates A is talking. 

get to this state, either both speakers previously were silent or both 
were talking. 

Figure 3, which is a reduction of Fig. 2, shows the model that the 
author has chosen to use. The upper left and lower right quadrants 
have been collapsed back to one state; simplicity has been gained at 
the expense of some loss of precision in modeling speech patterns. 

Allowable state transitions are indicated on Fig. 3. There is no at- 
tempt to control B's behavior; he starts and stops talking in his own 
manner. Notice that his state changes cause horizontal transitions. 

Vertical transitions are determined by A. If he is talking, he stops 
when a "fall silent pulse" occurs (Gustafson's terminology), and if silent 
he starts when a "start talking pulse" occurs. We call these /3— and 
a-pulses, respectively. These pulses are a result of Poisson processes,* so 
that, for example, if A is talking and B is silent (A solitary talk state), 
he stops talking in the next dt sec with probability PJ Bl -dt. 

For notation, the subscript on /?, the fall silent parameter, describes 
the present state, while the subscript on a, the start talking parameter, 
denotes the event that will occur if the pulse occurs. The superscript 
refers to A or B. The six values for /3 and a are denoted (3 ao i, ftted, 
Ptor, apse, otau, and a int (See Fig. 3) in which the abbreviations mean 
solitary, interrupted, interrupter, pause, alternate, and interrupt. 



* Poisson processes are tutorially discussed by Cox and Smith. 7 
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states. 



An eight-state model in which each state of Fig. 1 is divided into two 



It is very important to understand the nature of the a or /? param- 
eters. They are not probabilities. However, if any a or (3 is multiplied 
by dt (for example, dt = 0.005 s), then adt is the probability that A 
will leave the corresponding silence state "of his own volition" during 
the next dt seconds. (He may of course also be forced out of the state 
by B's action.) The adt's and pdt'a are "transitional" probabilities and 
do not represent the probability of being in each state. These "state" 
probabilities must be solved for, and can at times be difficult to ob- 
tain; they must consider the interaction of speaker A with his cor- 
respondent B. This is more fully discussed in Section III. 

The a and fi's have a more appealing physical interpretation than 
just probability parameters. If some a = 2.5, this implies that there 
is an "input stream" of a-pulses trying to drive A out of his state; 
the pulses occur at random times but at an average rate of 2.5 pulses 
per s, or with an average between-pulse interval of 1/2.5 s. The units 
of a and ft are pulses per second. 

In general, none of the a's or ft's is time dependent, so that the dura- 
tion of occupation of a state has no effect on the value of that state's 
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a or /?. An exception is that when A becomes silent, all a's are zero 
for 205 ms (so that only horizontal transitions can occur) , after which 
time they resume their model values, and when A starts to talk all 
B's are zero for 20 ms. This guarantees that all silences are > 200 ms, 
and talkspurts are > 15 ms. (If an a-pulse occurs at the 210th ms, a 
205 ms interval has occurred for that state, and the remaining 5 ms 
are assigned to the new state.) 

A summary of the assumptions made in the model is: 

(i) At any instant of time, A exists in one of six possible states. 

(ii) A's talk-silence behavior is governed by Poisson processes, 
whose parameters are functions of the state A is in, but not of the 
length of time in the state (except for the previously noted minimum 
event length requirement) . 
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Fig. 3 — The six-state model used in this study. Vertical transitions are a result 
of Poisson processes at A's side. Horizontal transitions, resulting from B, are in 
A'a external environment and are not generated by A's model. 



2452 THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 1969 

(Hi) The speakers cannot both change their speaking status at 
precisely the same instant of time. f 

2.2 Extracting the Model Parameters 

The six parameters for each of 32 speakers engaged in 16 conversations 
were derived in a very simple way: transition frequencies from state to 
state of the Fig. 1 model were counted by a brute-force stepping through 
each conversation. 

To illustrate the process, recall that each person's speech is coded into 
5 ms on-off intervals. Say that speaker A is in the solitary talk state 
(state 1, Fig. 3). /3? , can be found from a frequency count of A's falling 
silent from the state. Thus, 

number of times A falls 
silent from state 

ti.i ■ dt m fc : ■ (0.005) = ■ ~ — • (1) 

number of times A is in state, in- 
cluding numerator of this fraction 

Whenever A is in a state, his behavior can be regarded as a succession of 
Bernoulli trials, in which case the above ratio is a best unbiased estimator 
for j8f„i • (0.005), and hence of tf„i ■ 

There are certain "trials" or 5 ms intervals which are not included 
in the frequency count. If A just begins to talk, he cannot leave the 
state until the talkspurt > 20 ms, or there are four intervals of 0.005 
s; therefore, the first four intervals are not included. In silences, the 
first 21 intervals (205 ms) are not included. Also, if B's behavior pro- 
duces a horizontal transition, this interval is not included, although 
the intervals up to that one are counted. The rare intervals contain- 
ing both a horizontal and vertical transition are counted as vertical 
transitions. Table I is a list of the a and £ values for all 32 speakers 
in 16 conversations. 

2.3 Testing the Model 

2.3.1 Method of Testing 

To investigate the behavior of the model, a Monte Carlo simulator 
generated a model conversation of any desired length in a form 
which could be analyzed by Mrs. N. W. Shrimpton's speech analysis 



t In the author's simulation, they cannot both change status in the same 5 ms 
time slot. This does occur in the author's data from the speech detectors, but it is 
very rare. 
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program. (The output of the program is illustrated in Ref . 5 and some 
of it is shown here.) The general procedure was to extract parameters 
from a real conversation and then simulate a conversation of 20 min- 
utes duration. The original conversations were between 7 and 10 min- 
utes long, but the simulated ones were longer to better estimate the 
true theoretical behavior. (Economic considerations prohibited simula- 
tions significantly longer than 20 minutes.) 

If we could regard the two conversations of a real-simulated con- 
versation pair as independent samples from two populations (or the 
same population), then classical statistical tests (such as t-test on 
means) would be appropriate. Unfortunately, the simulated conversa- 
tions were derived from measurements of the real conversations, and 
standard tests no longer apply. For example, say that the talkspurt 
average lengths were veiy close for real and simulated conversations. 
With independent samples, this would suggest a good fit, but it may 
be that we have forced a good fit by setting simulated parameters 
equal to measured parameters of real speech. 

Instead of using statistical tests, we define a "fit parameter," or 
FP, to indicate the correspondence between real and simulated events. 
This correspondence is examined for three quantities: average lengths 
of the events, cumulative distribution functions (cdfs) of the events, 
and rate of occurrence (for example, number of talkspurts per second) . 
These three quantities are not independent; for example, a good fit of 
the cumulative distribution function (cdf) implies a good fit to the 
average (but the converse is not true) . In assessing a good or bad fit 
of the model to the speech data, the fit parameters are not treated as 
yielding three independent pieces of information, but rather as rep- 
resenting three viewpoints of the goodness of a fit problem. 

Table II is a list of the ten events. Two events, double talks (3) 
and mutual silences (4) , merit a brief comment. In the experimental 
conversations, with no circuit degradation or delay, these events are 
identical for both speakers. However, for consistency with the other 
events, comparisons of the fit parameter are made twice, once for 
each speaker. This causes some redundancy in the tabulated com- 
parisons of events 3 and 4 in Tables III through V. 

2.3.2 Average Lengths 

For average lengths, we define 

(X)r.al - (*).,■■. ,„, 

FP <X) (fit parameter) = A^, a ]. \i ' V ' 

Vn r , , n, im J 
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Table II — Categorized Speech Events 



Number 


Event* 


1 


Talkspurt 


2 


Pause 


3 


Double talk 


4 


Mutual silence 


5 


Alternation silence' 


6 


Pause in isolation 


7 


Solitary talkspurt 


S 


Interruption 


9 


Speech after interruption 


10 


Speech before interruption 



* For definition of "event" see Itef. 5. 

t In Fig. 6 of Ref. 5, the alternation silences illustrated in the sample patterns 
are all incorrectly labeled. The A's and B's are transposed. 

that is, the normalized difference between the means. If the observations 
were independent and from the same population, FP (Z) would be 
normal, n = 0, a 2 = 1. Independence is violated here, but we still can 
regard FP as an indication of similarity, and arbitrarily regard the fit 
as "bad" if | FP (Z) \ > 1.96. Table III lists FP (Z) for 10 events, 32 
speakers. (See Ref. 5 for tabulated average lengths of real speech events.) 

2.3.3 Cumulative Distribution Functions 

In a two-sample Kolmogorov-Smirnov test, in which n x observations 
are made for one sample and n., for the other, the test statistic D is 
the maximum vertical discrepancy (absolute value) between the cumu- 
lative distribution functions for the two samples. If both n x and n 2 
exceed 40, the identical population hypothesis is rejected at 0.05 level 
(seep. 131 of Ref. 8) if 



D > 1.36 



(n ± + n i y (3) 



Again, in our data the samples are not independent, and although 
n Him almost always exceeds 40, n rC ai often does not exceed 40 for those 
events surrounding interruptions. Nevertheless, we define 

"•—SafeSJ- (4) 

If FPa, > 1, the fit will be considered bad. Table IV lists FP cdf for 
10 events and 32 speakers. 

Comparative plots of cdf rait versus cdf sim for all 10 events and all 
32 speakers were generated. The curves for events 1 and 10 for speaker 
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2 of conversation 12 were arbitrarily selected for inclusion in this paper 
as Figs. 4 and 5. They illustrate a good and bad fit of the cumulative 
distribution functions, respectively. The plotted points are not data 
points; they represent category intervals of 15, 20, 30, • • • 200 ms, and 
1, 2, 3 s, and so on. Thus, the number of asterisks in the cumulative 
distribution functions of real speech, or of breakpoints in the con- 
nected curves of cumulative distribution functions of simulated speech, 
do not equal n real and n 8im , respectively. 

2.3.4 Rate of Occurrence 

To compare rate of occurrence of events 



FP„ = 



n. im /lengih of sim conversation 
n r „,; /length of real conversation 



(5) 



For a good fit, FP n should be close to 1.0. When either n is small, 
FP n may be changed considerably by the addition or subtraction of 
even one event, and unfortunately, FP n does not consider the absolute 
values of the n's in the comparison, as do the other two FP's. In addi- 
tion, we have not found a statistical test which is suitable for com- 
paring rates of occurrences of events such as our speech events. Table 
V therefore is included only as a listing of the values of FP n for eight 
events, 32 speakers without specifying good or bad fits. (Events 8, 9, 
and 10 occur an equal number of times; thus FP n is equal for the 
three events.) 
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Fig. 4 — Real and simulated talkspurfc distributions for speaker 2, conversation 
12, illustrating a good fit. Circles are not data points; they occur at arbitrary 
category intervals. Circles represent real speech ; connected curve, simulated 
patterns. 
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Fig. 5 — Real and simulated speech before interruption distributions, speaker 2, 
conversation 12, illustrating a bad fit. 



2.4 Discussion 

2.4.1 Goodness of Fit 

In this study, the model is regarded as successful if it can match the 
distributions and rates of occurrence of the ten events listed in Table II. 
To see how well this criterion is met, the three FP's are considered 
separately. 

First, notice in Table III that several columns (events) have no "bad" 
fits for average lengths. Now, in 32 trials of a legitimate 0.05 level test, 
we would expect about 1.5 failures; the six columns with no failures 
represent 160 trials (192 trials minus the 32 redundant trials for events 
3 and 4) with eight expected failures. The lack of failures tends to rule 
out the iV(0, 1) distribution of entries in a column. Further, inspection 
of column 2, for example, reveals a variable which is apparently not 
iV(0, 1); 28 out of 32 (87.5 percent) of observations are within ±1, as 
opposed to 68 percent for JV(0, !)• This substantiates our earlier state- 
ment that use of the FP does not constitute a legitimate statistical test. 

Without regard to statistics, however, the Table III data do show a 
"good" correspondence between real and simulated (that is, model 
predicted) averages for all events except 10 (speech before interruption) 
and possibly 7 (solitary talkspurts) . Looking now at Table IV, the model 
is clearly inadequate for event 10, but event 7 is not much worse than the 
rest. The Kolmogorov-Smirnov test (using FP edf ^ 1 as failure criterion) 
is powerful and would be a severe test if statistical tests were valid; for 
this reason, we regard the cumulative distribution functions fits as 
generally successful (except for event 10), but with room for improve- 
ment. Further, some fits are remarkably close, such as the talkspurt 
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cumulative distribution functions for conversation 12 speaker 2 (see 
Fig. 4). 

The rate of occurrence ratios in Table V are generally close to 1.0; 
there are a few scattered discrepancies but the model does not appear 
to have serious problems in generating a realistic number of events. 

Regarding individual conversations, a study of Tables III and IV 
shows no tendency for the model to fail on particular speakers. The 
speakers in conversation 3 have a few more failures than seems nor- 
mal, but even here the model exhibits "acceptable" fits for most events. 
The model appears equally valid for men (conversations 5 through 12) 
as for women (1 through 4 and 13 through 16) . 

2.4.2 Conversational Behavior 

The failure of the model in predicting speech-before-interruption 
intervals may shed some light on the behavior of the subjects. Table 
III shows that the simulated intervals are too long; real people tend 
to interrupt sooner than predicted by simulation. This may be a 
question of reaction time. The model assumes that the instant A (who 
is silent) hears B begin to speak, A is immediately in a "listen to £" 
state and will speak only if he wishes to interrupt. In reality, A may 
require some time — perhaps 200 ms — before he adjusts to the presence 
of J3's speech; in the meantime A's speech may not be intended as an 
interruption. A more sophisticated model might in fact assume the 
existence of a short delay in A's reception of B. 

The numerical values of the a's and /?'s (Table I) also provide clues 
to behavior. The absolute values are a little hard to interpret since 
they are so closely related to the design of the author's speech detec- 
tor. But notice that a aU is less than a p8e for each speaker, confirming 
our intuitive belief that a person is more likely to resume talking after 
a pause he generates than after a pause the other party generates. 
This also justifies having two different states for A in which B is 
silent; the model would certainly deteriorate if the states were merged 
to one with an "averaged" a parameter. 

Considering ft tor and ft tat, there is no consistent difference; /? ted > 
(Star for 15 of the 32 speakers. Thus, 15 (about half) of the subjects 
are more likely to terminate double talking if interrupted than if they 
are interruptors. A simpler model might merge these states, but serious 
errors might result for some subjects, since fited is often considerably 
different from /?, or . 

The lowest of the a's is predictably a int . A person is less likely to 
start talking when his correspondent is talking than when he is silent. 



2462 THE BELL SYSTEM TECHNICAL JOURNAL, SEPTEMBER 19G9 

The precision of the a's and /3's merits some attention. Because these 
quantities were measured over a person's entire conversation, they are 
not statistical estimators, but exact measures, correct to six figures. If 
you wish to regard the conversation as a sample of a larger population, 
however, you could regard an estimated value of a or as a measure of 
a population a or /3 and establish confidence limits. The a's and /3's were 
measured from Bernoulli trials where n varied from about 1000 to 40,000, 
depending on the conversation and parameter to be measured. Although 
the n's were large, the p values (a dt) were generally very small, typically 
about 0.005; standard deviations of a or /3 estimates could equal about 
0.1, with resulting 95 percent confidence limits of about ±0.2. 

2.4.3 Scope of the Model 

Telephone conversations usually begin with a brief but rapid inter- 
change of short words ("hello," and so on) . In many calls the calling 
party then assumes dominance, and then possibly the other party 
may dominate. Our model attempts to duplicate speech patterns using 
.six time-invariant parameters for each speaker and cannot, except 
by chance, generate the alternation of dominance which often occurs 
in real conversations. 

The model is, however, a very simple one. With only six states we 
are attempting to simulate the utterance patterns of a person, who is 
certainly not a six-state device. Simplicity is also achieved by the 
Markovian technique of having a person leave a state with a time- 
invariant probability, independent of the duration of state occupa- 
tion. (The minimum pause and talkspurt lengths constitute minor 
violations of this philosophy, but add little to the complexity of the 
model.) 

The real issue here is not whether such a simple model can duplicate 
all aspects of conversation behavior, but rather whether such a model 
is useful on its own. The author plans to test it by using it to investi- 
gate speech behavior on circuits with transmission delay; another 
group at Bell Laboratories is studying its applicability to circuits 
with switched-gain amplifiers. The ease with which the model can 
be simulated, plus its success in matching overall patterns, gives it 
the potential of becoming an important tool in the study of conversa- 
tional dynamics. 

It may eventually prove worthwhile to extend the model and try 
to get a closer match to the dynamics of conversation. One way to do 
this would be to increase the number of states. This might improve 
the fit to the "total pattern" distribution, but might require a huge 
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number of states before a realistic "dominance alternation" occurs. 
Another way would be to introduce time-varying a and /? parameters 
in the present six-state model. It appears that the development of 
either of these extended models (or a combination of them) would 
require an intensive amount of additional research. 

II r. MATHEMATICAL ANALYSIS 

The principal goal of this section is to find theoretical distribution 
functions of the ten speech events in Table II. A complete analysis 
of the Fig. 3 model is not possible, but it is possible to analyze a 
simplified model and extend the results. For analysis, the model must 
be connected to another speaker. Section 3.1 considers speakers A and 
B to be directly connected with no minimum pause and talkspurt 
restrictions. Section 3.2 introduces these restrictions, to make the 
model match the author's speech detector. Section 3.3 considers an 
exponential approximation to talkspurts and pauses, and Section 3.4 
discusses the effects on the analysis of introducing special circuits be- 
tween subjects (transmission delay, echo suppressors). 

3.1 Direct Connection oj Two Speakers 

Let the speech pattern model (Fig. 3) for speaker A be directly 
connected with one for speaker B. The entire A-B system thus exists 
in six states, since each state for A can be shown to correspond to a 
unique state for B. If all a's are forced to zero in the first 200 ms of 
silence, then a 200 ms minimum pause restriction is achieved; if /3's are 
zero for 15 ms of talking, a 15 ms minimum talkspurt is achieved. In 
this section, we do not use these restrictions ; we regard all a's and /3's 
as time invariant. Because each state is terminated by a Poisson pulse 
from either A or B, the entire system is Markovian and the duration of 
each state has an exponential distribution. 

This is illustrated, for example, by state 5. A will leave state 5 of his 
own volition in dt seconds with probability a£, , • dt. State 5 for A corre- 
sponds to state 4 for B; hence, B causes A to leave state 5 with prob- 
ability a", e -dt. A remains in state 5 with probability 1 — a^ lt -dt — 
a B , e -dt* State 5 is thus terminated by a Poisson process with parameter 
(a«ii + «p.«); ^ s duration is exponentially distributed with that param- 
eter (see p. 154 of Ref. 9). 

The appendix shows that even if only those events are considered in 
which A happens to terminate a state, these events are also exponential 



* Cox and Smith give an expository treatment of this kind of analysis. 7 
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with the parameter equal to the sum of the A and B "exit" parameters. 
For example, a "solitary talkspurt," in which A generates a talkspurt 
entirely within B's silence, is terminated when A leaves state 1 because 
of a P? ol - pulse. Nevertheless, A's solitary talkspurt is exponential 
with parameter (/3? , + af nt ) and therefore has an average length of 
V(/3f j + a fni)- (State 1 at A's side corresponds to state 6 at B's side.) 

This prediction for solitary talkspurt average lengths is well sup- 
ported by simulation and is in fair agreement with actual speech data. 
Table VI compares the predicted average talkspurt lengths for 32 
speakers with the measured averages from simulation. Only 2 out of 
32 fail a 5 percent level test, which indicates that the simulator (that 
is, model) behaves as predicted. 

Table VI also shows data from real speech. It is more appropriate 
to compare the real speech averages with simulated averages than 
with theoretical predictions, since the simulator contained the 15 ms 
and 200 ms minimum talkspurt and pause restrictions. Table III 
showed that 6 of the 32 average lengths of simulated solitary talk- 
spurts were judged to be "bad" fits to empirical averages. In addition, 
a product-moment correlation of 0.91 is found for the two columns of 
average lengths in Table VI. A reasonably good fit is thus suggested; 
but Table VI shows that the real speech average exceeds the simulated 
average in 25 of the 32 cases. There is therefore a definite but mild 
tendency for the model (that is, simulator) to predict solitary talk- 
spurts which are too short. This in no way refutes the result of the 
appendix, which is related only to the theoretical model. 

In summary, the six-state Markovian system in this section may 
be solved by standard techniques. The following conclusions seem 
most relevant to speech analysis. 

(i) A solution of the steady state probabilities of being in each of six 
states (that is, percent time in each state) may be obtained by routine 
solution of Markovian transition equations. This solution is not pre- 
sented here because it is cumbersome, and it is not required for finding 
the distributions of durations of many of the states. 

(it) The distribution of the duration of A's being in any one of the 
six states is exponential with its parameter equal to the sum of the A 
and B parameters for leaving the state. 

(Hi) The distribution of three speech events may be immediately 
deduced. The events are: 

(a) Alternation silence from B to A, in which B stops talking, there 
is a mutual silence, and A starts. This is distributed as the duration of 
state 5: exponential (a*u + «*.«)• 
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Table VI — Predicted and Empirical Average 
Solitary Talkspurt Lengths 





Speaker 


Predicted (s) 


Simulated 


Real Speech 


Conversation 


n 


Ave (b) 


n 


Ave (s) 


1 


1 


0.541 


342 


0.556 


158 


0.655 




2 


0.641 


246 


0.628 


149 


0.606 


2 


1 


0.971 


377 


0.983 


183 


1.035 




2 


1.070 


149 


0.995 


85 


1.343 


3 


1 


0.846 


213 


0.773 


116 


1.172 




2 


0.592 


380 


0.566 


166 


0.621 


4 


T 


0.906 


374 


0.969 


208 


0.966 




2 


0.832 


231 


0.834 


142 


0.952 


5 


1 


1.476 


122 


2.225* 


87 


1.631 




2 


2.968 


73 


3.282 


59 


3.045 


6 


1 


0.850 


79 


0.729 


38 


1.135 




2 


0.987 


175 


0.957 


106 


1.259 


7 


1 


0.739 


363 


0.832* 


181 


0.796 




2 


0.936 


320 


0.942 


156 


1.064 


8 


1 


1.072 


222 


1.114 


117 


1.192 




2 


1.251 


333 


1.160 


155 


1.380 


9 


1 


1.010 


2S0 


1.013 


82 


1.203 




2 


1.123 


231 


1.025 


80 


1.242 


10 


1 


1.079 


356 


1.071 


116 


1.157 




2 


1.268 


226 


1.271 


71 


1.234 


11 


1 


1.390 


144 


1.290 


51 


1.879 




2 


1.267 


82 


1.214 


23 


1.467 


12 


1 


1.237 


226 


1.130 


77 


1.405 




2 


0.973 


172 


1.027 


66 


0.948 


13 


1 


0.945 


214 


0.880 


76 


1.087 




2 


1.032 


282 


1.012 


97 


1.238 


14 


1 


1.130 


336 


1.121 


118 


1.221 




2 


0.880 


244 


0.860 


86 


1.047 


15 


1 


0.645 


203 


0.629 


56 


0.684 




2 


0.932 


388 


0.893 


124 


0.992 


16 


1 


1.004 


116 


0.958 


18 


1.335 




2 


0.554 


864 


0.549 


129 


0.549 



Predicted averages for A speakers (speakers 1) = l/(0,oi* + a in t B ), for B 
speakers = l/(/3„,< A + a in , A ). Values for a's and /3's were obtained from Table I. 
This prediction is slightly in error because of the 200 ms minimum pause requirement, 
as explained in Section 3.2. Significance (marked by asterisks) is at 0.05 level; x is 
assumed normal with mean = predicted average, a = mean/(n)*, since for a single 
observation from exponential distribution, a = y.. (Simulated and real speech n's 
are considerably different because lengths of conversations are different.) Product- 
moment correlation of simulated and real averages = 0.91. 



(6) Pause in isolation, which has the distribution of state 4: 
exponential {a*„ + af a )- 

(c) Solitary talkspurt which is exponential with parameter 
(B* ol + a B int ). (State 1 also has this distribution; but A's being instate 1 
does not imply a solitary talkspurt, since state 1 can be entered from 
double talking.) 

(iv) Two distributions are a little more difficult, but straightforward. 

(a) Double talk, in which states 2 and 3 are each exponential, but 
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with different parameters. The double talk density function is an average 
of the two exponential density functions, each weighted by the steady 
state probabilities of the states 2 and 3, respectively. The resulting 
distribution probably resembles an exponential, but is in general not 
strictly exponential unless states 2 and 3 are identically distributed.* 

(6) Mutual Silence, which is the same as in the case of double talk, 
but with states 4 and 5, which are each exponential. 

(y) The distributions of the remaining events of Table II are very 
difficult to derive. For example, we notice that a talkspurt can consist 
of an infinite possible sequence of states 1, 2, and 3. + Although there are 
techniques for handling problems of this type, they are complicated and 
in this case may yield formidable analytic expressions. 

Notice that for this completely Markovian system of the ten speech 
events of Table II, only three — alternation silence, pause in isolation, 
and solitary talkspurt — are strictly exponentially distributed. But all 
events consist of concatenations of the six states, which in turn are 
exponential. We could think of these states as exponential "building 
blocks" with which the speech events are constructed. 

3.2 Effect of Minimum Pause and Talkspurt Length 

The introduction of time-varying parameters to obtain minimum 
lengths for pauses and talkspurts ruins the Markovian structure of the 
model, and standard techniques are not applicable for solution. However, 
certain results are still obtainable. 

First of all, in the speech model, the 15 ms minimum talkspurt 
requirement is included because the author's speech detector, used to 
collect the speech data to test the model, uses a 15 ms throwaway for 
noise rejection in the raw speech data; hence all measured talkspurts 
exceed 15 ms. Even without the 15 ms restriction in the model, most 
simulated speech events are much longer than 15 ms, and we can 
anticipate only minor errors by ignoring the minimum length in the 
analysis. 

The 0.2 s minimum pause is harder to deal with ; it is long enough to 
affect the results. In general, constant lengths of 0.2 s are added to ex- 
ponentially distributed silent state durations. We refer to the resulting 
distribution as constant-plus-exponential. 

In some cases, one of the state exit parameters (say /3) may be zero 
for 0.2 s, while the other (a) remains at its usual value. Then, the first 

* Averaging two density functions is not equivalent to averaging two independ- 
ent exponential random variables. The latter operation yields a gamma distribu- 
tion. 

t Certain sequences are not possible, such as 1, 3, 2; but there are still infinitely 
many allowable ways A can wander among the three states before falling silent. 
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0.2 s is exponential with parameter a, and the probability that the 
interval will extend beyond 0.2 s is e' 02 ". Intervals beyond 0.2 s are 
constant-plus-exponential distributed, with constant = 0.2 s and 
exponential parameter = (ct + 0). Since the relative fraction of less-than 
versus greater-than 0.2 s intervals is known, the total distribution can be 
found by combining the pre- and post-0.2-s exponential segments. 

These results can be used to draw the following conclusions regard- 
ing event distributions. 

(i) Alternation silence (state 5) : assuming A has not talked for 200 ms 
prior to the state entry;* this is exponential (at,) for 0.2 s, and then 
exponential (at, + aj„). 

(ii) Pause in isolation (state 4) : these must be at least 0.2 s long, since 
A cannot terminate state 4 until that time. Hence, these are constant- 
plus-exponential distributed; constant = 0.2 s, exponential param- 
eter = (aj. + a?„). 

(Hi) Solitary talkspurt tends to be exponential (fit, + af nJ ); most 
state 1 durations are unaffected by the minimum pause requirement.* 

(iv) Double talk: states 2 and 3 distributions are completely unaffected 
by the minimum pause requirement, but their relative steady state 
probabilities may be changed somewhat, thus affecting the blend of the 
two density functions. The effect is probably slight, however, and the 
general shape of the distribution still looks very much as it did without 
the minimum pause length. This has the appearance of an exponential 
distribution, although not precisely exponential. 

(v) Mutual Silence: this distribution was predictable without the 
minimum pause requirement, but it now appears to be very complex 
and strongly affected by the 200 ms constant. All mutual silences which 
are "pauses in isolation" are at least 200 ms long, and those which are 
"alternation silences" usually start exponentially with parameter a aU , 
and after 200 ms they become exponential with parameter (at, + a,,,)- 
Figure 5 of Ref. 5 clearly shows the importance of the 200 ms constant 
in mutual silences. 

(vi) Remaining events are too complex to predict. Certainly, however, 
all talkspurts start with a 15 ms constant duration, and pauses start 
with a 200 ms constant duration. 

3.3 Exponential Approximation to Talkspurts and Pauses 

Exponential and constant-plus-exponential events are easy to simu- 



*Ref. 5 data suggest that less than 10 percent of state 5 intervals begin within 
200 ms of A'b speech. 

t Based on data from Ref. 5, we estimate that only about 6 or 7 percent of 
state 1 intervals begin with 200 ms of B's speech. 
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late. If one wanted to generate artificial talkspurts and pauses and 
was unconcerned with speaker interaction, could he use such a sim- 
plified model? We tried such a fit to the empirical talkspurt and pause 
distributions of the conversations described in Ref . 5. 

Talkspurts were fit by a straight exponential distribution, without 
a 15 ms constant, in which the exponential parameter was deduced 
from the average event length. That is, for a particular speaker let 

fi t , = 1/average talkspurt length; (6) 

then 

Prob (T ^ t) (cumulative function) = 1 — exp (— /?,,£)• (7) 

For pauses, we used a constant-plus-exponential. Let 

«p.« — l/(average pause length —0.2), (8) 

that is, the reciprocal average of the above 200 ms part of all pauses. 
Then 

Pr(7^0 = j° f ° r °^°- 2 (9) 

U - exp [-«„..(* - 0.2)] for t > 0.2. 

For comparing distribution functions, a Kolmogorov-Smirnov test 
was used to see if the empirical distribution function came from the 
particular exponential function based on /?,„ or a pgc . 10 Once again, the 
statistical test is not strictly appropriate, since the mean of the ex- 
ponential function is forced equal to the sample average. It still ap- 
pears, however, to be a reasonable heuristic method to determine if 
the "shape" of the curve is exponential. Only four out of 32 sets of 
talkspurts fail the test, suggesting that the exponential model is a good 
approximation for talkspurts. This is in agreement with the findings 
of JafTe and others. 2 None of the pauses fit constant-plus-exponential. 
This probably results from trying to fit one distribution to two dis- 
tinctly different kinds of pause: pause in isolation, which occurs be- 
tween words and is short, and the long silence which occurs when 
listening to the other speaker. 

The good exponential fit to talkspurts might cause one to feel that 
the talkspurts could be modeled by a single parameter Poisson process. 
This would be achieved by having a single "talk" state, instead of 
three; once the state is entered, the speaker would ignore the other's 
speech and stop talking when his single parameter /3-pulse occurred. 
Although a reasonable talkspurt fit would be achieved, other speech 
events, such as double talk and interruptions, would be poorly 
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matched for most speakers. This is true because the measured values 
of the three /?'s of Fig. 3 are generally quite different, with the two 
double talking fi's often different from each other and typically at 
least twice p go i (see Table I). A single parameter Poisson process 
would incorrectly assume these fi's to be equal to each other. 

Why, then, do we get a good exponential fit to the general talk- 
spurt distribution? Table II of Ref. 5, for -40 dBm threshold, shows 
that state 1 accounts for about 88 percent of A's talking time, so that 
the different (3's during double talking exert only a minor effect upon 
the predominant state 1 single parameter Poisson process.* That is, 
the long and frequent state 1 intervals tend to obliterate the fine 
structure of the double talks. 

3.4 Connection of Two Models Over Special Circuits 

When A and B are directly connected, equations at A's side are 
easily written because knowledge of A's state at a random instant 
implies knowledge of B's state. (Once again, for simplicity, assume a 
Markovian model with no minimum event length requirements.) 
Analysis becomes very difficult when the circuit prohibits such knowl- 
edge. Two such circuits are considered here: Circuits with transmission 
delay and with echo suppressors. 

3.4.1 Delay 

The feasibility of transmitting two-way telephone calls over satellite 
circuits has generated widespread interest in the effects of transmission 
delay on the behavior of the conversants. We have previously dealt with 
a system which connected two three-state Markovian devices over a 
channel with transmission delay. 4,11 The following conclusions are of 
interest here. 

(i) If the delay is "short" (in the order of average pause lengths or 
less, as occurs in cases of practical interest, where D ^ 1200 ms), an 
exact analysis has not yet been found, and approximations are required 
to solve even the simple three-state system. 

(it) For very long delays, asymptotic system behavior of the model 
is obtainable; but the model is of doubtful validity since an entirely 
different kind of speech behavior might result from excessively long 
delays. 

* One or the other speaker, but not both, talks for 100-24.99 (mutual silence) 
— 4.62 (double talk) = 70.39 percent of the time; this accounts for states 1 and 
6 at A's side. State 1 is occupied about half this time, or 35.20 percent. This is 
88 percent of 3520 + 4.62, which is A's talking time. 
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3.4.2 Echo Suppressors 

For our purposes, echo suppressors are devices which occasionally 
block the A to B or B to A (or both) transmission paths, at times 
depending on the interaction of the A and B speech patterns. (For 
further details on echo suppressors see Ref. 12.) There may also be 
delay, but even without delay the time dependency and uncertainty in 
the system is apparent and virtually prohibits formal analysis. 

For both delay and echo suppressors, simulation is not difficult (the 
author's simulator already incorporates delay) and provides at present 
the only means of assessing the performance of the model. 

3.5 Summary 

The six-state model described by the author contains time depend- 
encies which prevent formal Markovian analysis, but there is a 
tendency for the speech events to be formed from exponential, and in 
some cases, constant-plus-exponential "building blocks." Practically 
all of the exponential blocks or exponential parts of the constant-plus- 
exponential blocks have distributions with parameters equal to the 
sum of the A and B "exit probability" parameters; and even those 
events which seem exclusively a result of one speaker (such as solitary 
talkspurts) are in fact influenced by both speakers in a predictable 
way. 

Although several theoretical results are obtainable, one is forced 
to turn to simulation for complete quantitative results. The ease by 
which the model is simulated helps compensate for the numerous com- 
puter runs required for studying model behavior as a function of 
parameter or circuit changes. 
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APPENDIX 

Distribution of a State Terminated By a Particular Speaker 

This appendix is a derivation of the result stated in Section 3.1, that 
if, for example, one considers only those state 1 intervals terminated 
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by A, these will have an exponential distribution with parameter 
(0*ai + «f n «)- For shorthand, we call the parameters /3 and a. Let state 1 
begin at time t = 0. The joint probability that it is t s long and termi- 
nated by A is: 

Pr (terminated by A in t, t + d£) = e ~ ta * ni ' j8 dt; (10) 

that is, neither an a — or /3— pulse can occur in (0, t), and one /3-pulse 
must occur in (I, t + dt). Integrating equation (10) over all t, 

Pr (state terminated by A at any time) 

= f X (10) dt = P/(a + 0), (11) 

•'0 

as it should. We desire the conditional probability that state 1 ends in 
(t, t + dt) given that it is terminated by A. By Bayes' rule, 

Pr (state ends in (t, t + dt) | terminated by A) 

= joint Pr (state ends in (t, t 4- dt) and is terminated by A) 

Pr (terminated by A) 

= equation (10)/equation (11) = e' ta+P)t (a + 0) dt, (12) 

which is recognized as an exponential density function with parameter 
(a + P). 
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